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ABSTRACT 


The  problem  of  determining  whether  a given  interval  of  a speech 
signal  should  be  classified  as  voiced  speech,  unvoiced  speech  or  silence 
is  formulated  as  a test  of  statistical  hypotheses.  A robust  detector  is 
obtained  by  modelling  the  speech  and  the  acoustic  background  noise  signals 
as  correlated  Gaussian  random  processes.  The  methods  of  statistical  decision 
theory  areapplied  to  these  models  to  synthesize  an  optimum,  minimum  probability 
of  error,  classifier. 

The  optimum  classifier  is  an  estirator-correlator  receiver  which  is 
well  approximated  using  a linear  phase  hign  pass  filter  in  the  unvoiced  channel 
and  a linear  phase  low  pass  filter  in  the  voiced  channel.  A clutter  filter 
appears  in  the  reference  channel  which  tries  to  eliminate  as  much  noise  as 
possible  before  forming  the  unvoiced  and  voiced  correlations.  The  statistics 
of  the  noise  are  learned  during  the  silent  intervals  which  makes  the  classifier 
adaptive  to  time-varying  noise  statistics. . 

Knowledge  of  the  clutter  correl atioVsfunction  permits  implementation 

of  adaptive  Wiener  filters  which  are  used  to  ek^minate  as  much  noise  as  possible 

\ 

prior  to  the  determination  of  pitch  and  the  estimation  of  the  LPC  filter 
coefficients.  The  clutter  filtered  voiced  speecli  signal  is  then  passed  through 
a bank  of  comb  filters  and  the  pitch  estimate  chosen  to  correspond  to  the  filter 
for  which  the  output  energy  is  largest.  It  is  shown  that  this  pitch  estimation 
strategy  is  optimum  and  robust  as  long  as  the  correlation  time  of  the  noise 
is  less  than  the  minimum  pitch  period  of  interest. 

i i i 


The  robust  LPC  vocoder  is  evaluated  experimentally  for  Airborne 
Command  Post  noise  for  which  the  unvoiced  speech  signal-to-noise  ratio 
is  often  less  than  0 dB.  Based  on  listening  tests  comparing  the  input  speed, 
plus  noise,  versus  standard  LPC  synthesis  techniques  versus  the  robust  LPC 
vocoder,  it  is  concluded  that  rather  dramatic  improvements  in  speech  intelli 
gibility  can  be  obtained  at  the  expense  of  a marginal  increase  in  computatio 
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INTRODUCTION  AND  SUMMARY 


I . 

There  are  a variety  of  applications  in  which  it  is  necessary  to  be 

able  to  classify  a given  set  of  speech  data  as  corresponding  to  voiced  speech, 

unvoiced  speech  or  silence.  For  the  synthesis  of  speech  using  Linear  Predictive 

1-4 

Coding  (LPC)  techniques  ■ , for  example,  it  is  necessary  that  the  speech  signal 
be  classified  as  voiced  or  unvoiced.  This  information  is  transmitted  to  the 
speech  synthesizer  along  with  coefficients  that  represent  an  all-pole  linear 
filter  model  for  the  vocal  tract.  For  voiced  speech  the  filter  is  excited 
by  a periodic  train  of  impulses,  whereas  a white  noise  excitation  is  used 
when  unvoiced  speech  is  to  be  synthesized. 

The  ability  to  detect  silence  is  of  interest  in  digital  communications 
in  which  channel  capacity  is  at  a premium''’.  By  detecting  intervals  of  silence, 
other  data  streams  can  be  interleaved  with  the  speech  conversation  thereby 
maximizing  the  utilization  of  the  available  bandwidth.  Another  application  of 
silence  detection  arises  in  conferencing  situations^.  By  detecting  when  a set 
of  speakers  are  silent,  their  lines  can  be  disconnected  from  the  superposition 
of  inputs  so  that  an  enhancement  of  syntlicsizer  input  signal-to-noise  ratio 
can  be  obtained. 

Solutions  to  the  classification  problem  have,  for  the  most  part,  been 
developed  on  an  ad  hoc  basis  in  which  an  individual  discriminant  is  proposed 
which  seems  to  characterize  in  one  way  or  another  the  attributes  of  the  three 
possible  speech  events.  In  a ricent  paper,  Atal  and  Rabiner^’  have  proposed 
an  algorithm  that  simultaneously  compute;',  five  of  the  most  significant  dis- 
criminants and  uses  a hypothesis  testing  strategy  to  assign  a given  set  of 
observations  to  one  of  the  three  speech  classes. 
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With  few  exceptions,  most  notably  the  work  of  Ata]  and 
Rabiner,  most  of  the  speech  research  reported  to  date  has  dealt 


with  a speech  environment  that  has  been  carefully  controlled  in  the 
sense  that  background  noise  and  interference  signals  have  been  eliminated 
from  the  speech.  It  is  generally  known  that  the  intelligibility  of  modern 
vocoders  is  seriously  degraded  when  noise  and  interference  signals  are  super- 
imposed on  the  speech  data^.  Since  there  are  many  practical  problems  in  which 
noise  and  interference  arise,  it  is  of  interest  to  develop  more  general  speech 
processing  techniques  designed  to  eliminate  the  noise  as  much  as  possible. 

In  this  paper  it  is  assumed  that  the  speech  signals  are  corrupted 
by  additive  Gaussian  noise  that  may  or  may  not  be  white.  The  unvoiced  speech 
signal  is  modelled  as  a zero  mean  Gaussian  random  process  having  a known 
covariance  function.  Voiced  speech  is  modelled  as  a zero  mean  Gaussian 
quasi-periodic  random  process.  Using  these  models  as  a starting 
point  the  classification  problem  is  formulated  as  a statistical 
hy7)othesis  test  and  solved  using  statistical  decision  theory. 

Subject  to  the  validity  of  the  underlying  speech  models,  the 
resulting  signal  processing  algoritlim  is  optimum  in  the  sense  that  the 
probability  of  a decision  error  is  minimized.  The  advantage  of  this  approach 
is  that  the  discrimination  criteria  are  synthesized  from  the  model,  rather 
than  being  selected  on  an  ad  hoc  basis. 

The  classification  problem  is  recognized  as  a Gauss-in-Gauss 
detection  problem  for  which  solutions  have  been  catalogued  by  Van  I'rees^. 

The  estimator-correlator  structure  was  chosen  since  it  led  most  naturally  to  a 
practical  implementation.  If  pitch  information  is  available,  additional 
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discrimination  can  be  provided  in  the  voiced  speech  channel  using  a comb 
filter  tuned  to  the  most  recent  estimate  of  the  pitch. 

The  ability  to  detect  the  silent  intervals  (noise  alone)  means  that 
the  stat istics  of  the c hitter  can  be  learned  and  used  to  implement  adaptive 
Wiener  filters  to  enhance  the  speech  signals  prior  to  coding.  In  this  mode 
the  adaptive  prefilter  can  be  used  as  a pre-processor  for  any  narrowband  or 
wideband  speech  encoder. 

An  extensive  experimental  program  was  developed  to  evaluate  the  class- 
ifier in  a variety  of  acoustic  noise  environments  including  shipboard  noise, 
office  noise,  helicopter  noise  and  noise  in  an  airborne  command  post.  The 
results  for  airborne  command  post  noise  are  included  in  this  paper. 

II . MODELS  FOR  SILENCE,  UNVOICED  AND  VOICED  SPEECH 

The  basic  problem  of  detecting  the  presence  of  silence,  unvoiced  speech 
or  voiced  speech  in  a given  set  of  data  can  be  formulated  as  a statistical 
test  for  choosing  one  of  the  three  hypotheses: 

: silence:  y(n)  = w(n) 

11.,:  unvoiced:  y(n)  = u(n)  + w(n)  (2.1) 

H.^:  voiced:  y(n)  = v(n)  + w(n) 

where  w(n) , u(n)  and  v(n)  represent  the  nth  sample  of  noise,  unvoiced  speech 
and  voiced  speech  waveforms  respectively.  Based  on  a set  of  observations 
y(l),  y(2) , . . . ,y(N)  it  is  desired  to  develop  a decision  rule  for  determining 
which  of  the  three  hypotheses  "best"  cliaractcrizes  the  data  set.  This  is 
the  classification  problem.  In  order  to  synthesize  an  optimum  decision  rule 
in  the  sense  that  a classification  is  made  with  minimum  probability  of  error. 


it  is  necessary  to  develop  statistical  models  that  characterize  the  data  for 
each  of  the  three  speech  events. 

To  begin  with,  the  interference  will  be  assumed  to  consist  of  simply 
zero  mean  white  Gaussian  noise.  Once  the  detector  structure  has  been  analyzed 
and  understood  for  this  case  the  generalization  to  non-white  noise  spectra 
follows  almost  by  inspection. 

In  order  to  derive  the  structure  of  the  classifier  it  suffices 
to  model  the  unvoiced  and  voiced  speech  waveforms  as  sample  functions  of  Gaussian 
random  processes  having  zero  means  and  covariance  functions  and  R^(k) 

respectively.  In  addition  voiced  speech  is  assumed  to  be  quasi-periodic  in 
the  sense  that  R^(k+T)  = Ry(k)  where  T is  the  period  of  the  process.  This 

g 

means  that  almost  every  sample  function  is  periodic  with  period  T . 

The  preceding  discussion  can  be  summarized  succinctly  by  the  follow- 
ing set  of  modelling  equations.  Under  hypothesis  Ih  the  observed  data  set 
is  given  by; 

y(n)  = s^(n)  + w(n)  i = 1,  2,  5 (2.2) 

where  Sj(n)  = 0 for  silence,  s.,(n)  is  a Gaussian  random  process  with  mean  zero 
and  covariance  for  unvoiced  speech  and  s^(n)  is  a zero  mean  c)uasi- 

periodic  Gaussian  random  process  with  covariance  function  R^Ck)  for  voiced 
speech.  In  all  cases  the  noise  term  w(n)  represents  a zero  mean  Gaussian 
white  noise  random  process  having  the  correlation  function  l^^(k)  = o'^<5(k). 

III.  TUT.  OPTIMUM  CLASS  1 1- lER  AGAINST  WllITl-  NOISF. 

I'he  optimum  classifier  processes  the  raw  speech  data  y(l),  y(2),... 
y(N)  in  such  a way  that  a decision  is  made  with  minimum  probability  of  error 
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on  whether  the  given  interval  of  signal  should  be  classified  as  voiced  speech, 

unvoiced  speech  or  silence.  Using  statistical  decision  theory  the  minimum 

probability  error  decision  rule  is: 

"Declare  hypothesis  H.  to  be  true  if  and  only  if  the  a posteriori 
probability  that  is  true  conditioned  on  the  observation  set 
y(l),  y(2),...y(N)  is  largest,"  i.e., 

p[Hjy[N),...,y(D]  = 3 

Signal  processing  configurations  of  the  likelihood  ratio  test  have  been  docu- 
mented by  Van  Trees^.  For  the  special  case  of  ternary  hypotheses,  zero  means 
and  stationary  random  processes  the  test  is  implemented  by  computing  three 
sufficient  statistics  denoted  by  L,  i=l,2,3.  The  first  component  of  the  ith 
statistic  is 


N 

n=l 


where  Sj(n)  is  the  linear  least  squares  unrealizable  estimate  of  the  ith 
signal  s^(n).  The  bias  component  of  the  ith  sufficient  statistic  is 


G.  (f) 

Hn  [1  + df  i=l,2,3  (5.2) 


where  T = N/F^  is  the  observation  time  of  the  process,  F^  is  the  sampling  rate, 
G^(f)  is  the  power  spectrum  of  the  ith  random  process  and  N^/2  is  the  two-sided 
white  noise  spectral  density.  The  complete  ith  sufficient  statistic  is 
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1.  = I . *l 
1 yi 


i=l,2,3 


and  the  test  consists  of  choosing  the  largest  of 


1.  + £.  n P . 
1 1 


i=l,2,3 


where  is  the  a priori  probability  that  hypothesis  is  true.  The  goal 

now  is  to  use  the  gross  attributes  of  speech  signals  to  simplify  the  computations 

involved  in  implementing  the  likelihood  ratio  test. 

Under  hypothesis  , which  corresponds  to  silence,  the  anticipated 

signal  is  s, (n)  = 0.  Therefore  fn)  = 0 whence  I = 0,  £ = 0 and  £ = £nP, . 

1 1 ^1  °1  ^ ^ 

The  likelihood  ratio  test  reduces  to  computing  only  two  statistics 


£ + mP-  - £nP, 

^2  ^ ^ 


(3.5a) 


+ JnPj  - XnP^ 


(3.5b) 


in  which  only  £ and  £ involve  the  raw  data,  £„  and  £_  being  fixed  biases 
^2  ^3  2 3 

reflecting  the  average  energy  in  the  ensembles  of  unvoiced  and  voiced  speech 

sounds.  Letting 


(3.6a) 


- £nP,  + ?nPj 


(3.6b) 


• V,  - ‘^’’2  * 


2 7> 


(3. be) 


t 

1 

L 


the  classification  rule  reduces  to  the  following: 

declare  silence 

V/  >1.  ” V/  V V,  ' 

^ ' S 2 ^ 3 uv 

declare  unvoiced  speech 

i > \ OT  a > \ and  I -I  < X f3.7cl 

u Yj  ^ X2  Yj  uv 

declare  voiced  speech 

In  order  to  simplify  the  test  further  it  is  noted  from  (3.2)  that 

the  bias  terms  S-„  and  ^te  related  to  the  energy  in  the  ensemble  of  unvoiced 

®3 

speech  and  voiced  speech  sample  functions.  If  a global  average  is  taken,  the 

voiced  speech  spectrum  will  have  significantly  more  energy  than  that  of 

unvoiced  speech  which  would  contribute  a negative  bias  in  favour  of  the  unvoiced 

speech  hypothesis.  Using  this  bias  would  be  valid  if  voiced  speech  were  truly 

stationary.  In  fact  however  not  only  do  the  spectral  properties  change  from 

frame  to  frame  but  more  importantly  the  amplitude  undergoes  a slowly  increasing 

and  decreasing  modulation  at  the  beginning  and  ending  of  a voiced  sound.  Since 

10-20  msec  frames  of  speech  represent  the  data  base  upon  which  a classification 

is  to  be  made,  then  from  a sample  function  point  of  view  the  energy  in  a frame 

of  unvoiced  speech  or  a frame  of  voiced  speech  could  be  comparable.  The 

inclusion  of  the  ensemble  average  energy  bias  term  would  therefore  incorrectly 

favour  unvoiced  speech.  Therefore  the  bias  terms  X-  and  must  be  assumed 

®2  ®3 

to  be  equal.  Under  this  condition  the  thresholds  reduce  to 

A = - Jl  . X,nP,  + 5-nP,  (3.8a) 

u B 2 1 

+ {.nPj  (3.8b) 


(1)  If: 

(2)  If: 

(3)  If; 
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(3.8c) 


A 

uv 


= + 


tnP, 


where  represents  an  unknown  bias  term  related  to  the  a priori 

B B3  

knowledge  of  the  energy  in  the  unvoiced  and  voiced  speech  signals. 

Although  the  Bayesian  detection  theory  demands  that  the  bias  term 
and  priori  probabilities  be  calculated,  a more  practical  method  for  determining 
the  thresholds  would  be  to  train  the  system  against  noise  and  then  choose  those 
values  that  keep  the  false  alarm  rate  at  a value  consistent  with  the  system 
objectives.  For  example  a much  greater  penalty  is  paid  for  failing  to  detect 

I 

speech  than  falsely  classifying  noise  as  speech.  Therefore  the  thresholds 

most  likely  should  be  set  close  to  the  1-sigma  values  of  I and  X-  obtained 

^2  ^3 

during  the  noise  training  phase.  This  strategy  is  ideal  for  self-adaptive 

tracking  of  the  noise  statistics  should  they  be  non -stationary.  The  voicing 

threshold  is  most  reasonably  approximated  by  zero  when  the  signal -to-noise 

ratio  is  large  or  the  noise  is  white.  When  this  is  not  the  case, 

this  threshold  can  also  be  trained  to  the  1-sigma  value  of  £ - £ . 

^2  ^3 


As  a result  of  the  preceding  analysis  the  only  statistics  that  must 
be  calculated  at  each  frame  time  are  the  correlations 


N 

Z = Z y(n)  s. (n)  i=2,3 

^i  n=l  ^ 


(3.9) 


where  y(n)  is  the  raw  speech  plus  noise  data  and  s^ (n)  is  the  linear  least 
squares  unrealizable  estimate  of  s^(n)  given  that  hypothesis  H.  is  true.  .Since 
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the  unvoiced  and  voiced  speech  waveforms  are  q-*’ si -stationary  the  filter  that 


results  in  s^  (n)  given  that  y(n)  = Sj(n")  + w(n)  has  the  transfer  function 

G.(f) 

H.(f)  = i (3.10) 

^ G. (f)  + N /2 

1 o 

The  filters  defined  by  (3.10)  obtain  enhanced  discrimination  against 
noise  by  passing  only  those  frequencies  where  the  signal  power  is  substantially 
larger  than  the  noise  power.  Enhanced  voiced-unvoiced  discrimination  depends 
on  the  implicit  orthogonality  of  the  two  random  processes  as  reflected  by  the 
degree  to  which  the  spectral  densities  are  correlated.  Both  of  these  detection 
statistics  can  be  improved  by  capitalizing  on  the  quasi-periodic  properties 
of  voiced  speech.  If  the  voiced  speech  process  is  periodic  with  period  T then 
the  voiced  speech  power  spectrum  is  more  accurately  represented  by 

Gj(f)  = C(f;T)  G^(f)  (3.11) 

where  G^(f)  represents  the  gross  properties  of  the  spectral  envelope  and  C(f;T) 
is  a comb  filter  reflecting  the  fine  structure  of  the  periodic  spectrum.  If 
the  period  is  maintained  for  M periods  then 

C(f;T)  = i • exp  [j^M-D  f/F]  (3.12) 

where  I'=l/T  represents  the  pitch  frequency.  Not  only  docs  the  comb  filter 
enhance  the  voiced  specch-to-noi sc  ratio  but  it  also  increases  the  ort hogona 1 i t > 
of  the  voiced  and  unvoiced  spectra.  In  order  to  e.x)>loit  the  additional  liis- 
crimination  implicit  in  the  comb  filter  it  is  necessary  that  the  pitch  perioii 
be  known.  A discussion  of  how  the  pitch  is  to  tie  determined  will  be  deferred 


to  a later  section. 


10 


Subject  to  the  assumptions  that  the  envelopes  of  the  unvoiced  and 
voiced  speech  power  spectra  are  known  and  that  the  pitch  period  for  voiced 
speech  can  be  estimated  then  the  optimum  classifier  can  be  implemented  as 
shown  in  Figure  1. 

Of  course  all  of  this  information  is  not  available  a priori  and  it  will 
be  necessary  to  introduce  approximations  to  the  filtering  and  estimation 
operations  while  maintaining  the  basic  structure  of  the  estimator-correlator 
receiver.  This  will  be  the  goal  of  the  next  section. 

IV.  PRACTICAL  IMPLEMENTATION  OF  THE  ESTIMATOR -CORRELATOR  SPEECH  CUSSIFIER 

AGAINST  WHITE  NOISE 

For  voiced  speech  the  optimum  minimum  mean  squared  error  filter 
has  the  transfer  function 


H (f) 
V 


G^(f)  C(f;T) 
G^(f)C(f;T)  + N^/2 


which  passes  those  frequencies  at  which  the  signal  power  is  substantially 
larger  than  the  noise  power  and  rejects  all  others.  Certainly  the  comb  filter 
in  the  denominator  contributes  to  the  definition  of  those  frequencies  at 
which  noise  rejection  should  occur.  However,  in  white  noise  approximately  the 
same  rejection  performance  can  be  obtained  by  a cascade  combination  of  the  comb 
filter  and  the  least  squares  filter  designed  on  the  basis  of  simply  the  spectral 
envelope.  Therefore  the  voiced  speech  estimator  filter  is  taken  to  be 


11  (f)  = C(f;T)  • 


G (f) 

V 

G (f)  + N n 

V o 


(4.2) 


For  unvoiced  speech  the  estimator  filter  is 


(f) 


G^(f) 


G (f)  + N /2 
u o 


(4.3) 


Setting  i=2  for  unvoiced  and  i=3  for  voiced,  the  Wiener  filters  based  on  the 
spectral  envelopes  for  both  cases  can  be  written  as 

(4.4) 


H.(z) 


^ 1 -k 

L a,  z 

k=  - ” 


where  the  coefficients  a^^  satisfy  the  Wiener-Hopf  equation 


j 

} 


J 

I 

I 

I 

) 


00 

y a^  2 

[Ri(k-j)  ^ o'  6(k-j)]  = R.(j)  -oo<j<«.  (4.5) 

2 

where  a = (N^/2)F^  represents  the  energy  in  the  noise  process  (F^  is  the 
sampling  rate)  and  where  R2(k),  Rj(k)  are  the  sampled  data  correlation  functions 
corresponding  to  the  power  spectra  G^(f),  G^(f)  respectively.  In  practice 
the  correlation  functions  can  be  suitably  truncated  and  then  (4.4)  can  be 

9 

efficiently  solved  using  the  Levinson  recursion  . Of  course  the  solution  re- 
quires that  the  correlation  functions  for  an  ensemble  of  unvoiced  and  voiced 
speech  sample  functions  be  computed  for  a large  class  of  utterances  and  a 
large  class  of  speakers.  In  order  to  bootstrap  the  system  initial  classification 
would  have  to  be  done  manually  which  would  be  extremely  tedious  and  time  con- 
suming. In  order  to  avoid  this  problem  a more  practical  and  robust  strategy 
is  proposed  based  on  the  well  known  global  properties  of  unvoiced  and  voiced 


1 

1 


1 
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speech  spectra  and  a close  examination  of  the  filtering  operation  defined 
in  (4.2)  and  (4.3). 

The  essence  of  the  Wiener  filter  is  to  pass  those  frequencies  at  which 
the  speech  power  is  substantial ly  larger  than  the  noise  power.  As  a good 
first  approximation  it  seems  reasonable  to  approximate  the  Wiener  filter  by  a 
passband  filter  that  passes  "most"  of  the  energy  in  an  unvoiced  or  voiced  speech 
sound.  For  unvoiced  speech  it  can  be  assumed  that  "most"  of  the  energy  will  be 
above  1000  Hz  while  for  voiced  speech  "most"  of  the  energy  will  be  below 
2000  Hz.  While  restricting  the  estimator  filters  to  these  frequencies  improves 
the  detection  SNR  of  unvoiced  and  voiced  speech,  of  at  least  equal  importance 

is  the  ability  of  the  unvoiced  filter  to  reject  voiced  speech  and  vice  versa. 
Since  the  first  formant  of  voiced  speech  is  approximately  1000  Hz  then  if  the 
cutoff  of  the  unvoiced  speech  filter  is  above  1250  Hz  then  most  of  the  unvoiced 
speech  energy  will  pass  through  the  filter  while  a large  fraction  of  a voiced 
speech  signal  will  be  attenuated.  Similarly  if  the  cutoff  of  the  voiced  speech 
signal  is  above  2000  Hz  then  most  of  Its  energy  will  pass  through  the  voiced 
filter  while  a substantial  fraction  of  an  unvoiced  speech  signal  will  be 
attenuated.  From  this  point  of  view  it  can  be  seen  that  it  is  crucial  that 
the  input  data  to  the  classifier  not  be  preemphasized  since  the  higher  formants 
of  a voiced  speech  signal  would  take  on  the  attributes  of  an  unvoiced  speech 
waveform  at  the  expense  of  good  classifier  performance.  Therefore  if 
pre-emphasis  is  to  be  used  for  speech  analysis  and  synthesis  the  data  will 
have  to  undergo  digital  deemphasis  jirior  to  speech  classification. 


MCc- jA;>. 
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Fig,  2.  Practical  realization  of  the  optimum  speech  classifier. 


On  the  basis  of  the  preceding  arguments  the  Wiener  filter  for  unvoiced 


speech  will  be  approx imatetl  by  a high  pass  linear  phase  digital  filter  whose 
cutoff  frequency  is  below  1250  Hz.  For  voiced  speech  a lowpass  linear  phase 
digital  filter  having  a cutoff  frequency  above  2000Hz  will  be  used.  The  linear 
phase  requirement  is  essential  since  the  temporal  properties  of  the  waveforms 
must  be  preserved  in  order  that  a meaningful  correlation  operation  be  obtained. 
The  practical  implementation  of  the  optimum  classifier  against  vdiite  noise  is 
shown  in  Figure  2.  The  detailed  characteristics  of  the  linear  phase  filters  are 
provided  in  the  Appendix. 

Implicit  in  the  realization  illustrated  in  Figure  2 is  the  estimation 
of  the  pitch  period  of  a voiced  waveform  so  that  the  additional  discrimination 
inherent  in  tne  >,omb  filter  can  be  exploited.  A further  simplification  in 
processor  complex. ty  can  be  obtained  simply  by  omitting  the  comb  filter  and 
relying  on  the  spectral  orthogonality  of  the  two  speech  types.  However,  since 
the  periodicity  of  the  voiced  speech  process  is  a potentially  powerful 
classification  discriminant,  for  theoretical  completeness,  it  is  worthwhile 
to  develop  a practical  algorithm  to  exploit  it.  Since  this  necessitates 
an  estimate  of  the  pitch  period,  a brief  exposition  of  an  optimum  pitch 

estimation  algorithm  will  be  presented. 

V.  OPTIMUM  PITCH  ESTIMATION 

Voiced  speech  was  modelled  as  a periodic  random  process  in  the  sense 
that  = R^(k+T)  for  some  pitch  period  T.  This  means  that  almost  every 

Simple  function  in  the  ensemble  is  periodic  with  period  T.  Therefore 
the  voiced  speech  signal,  v(n'),  can  be  motielled  as 


where  q(l),  q(2),..,q(T)  are  completely  unknown.  Of  course  to  be  faithful 
to  the  random  process  formulation  of  voiced  speech,  the  quantities  q(k) 
should  be  treated  as  correlated  random  variables.  However  to  keep  the 
estimation  problem  mathematically  tractable  the  correlation  properties  will  be 
ignored  at  first.  The  voiced  speech  data  are  therefore  taken  to  be 

y(n)  = v(n)  > w(n)  (5.2) 

where  w(n)  represents  white  (laussian  noise  and  v(n)  is  given  by  (5.1).  Based 
on  N samples  of  this  data  the  parameters  q(l),  q(2) , . . . ,q(T)  and  T are  to  be 
est imated . 

The  above  formulation  of  the  pitch  estimation  problem  was  formulated 
and  solved  by  Wise,  Caprio  and  Parks Using  the  maximum  likelihood  estimation 
rule  they  minimized  the  cost  function 

D(£,T)  = E [y(n)  - v(n)]^ 

n=l 

N T M-1 

= X y'^Cn)  - 2 E E y(k+mT)  v(k+mT) 

n=l  k=l  m=0 

T M-1 

+ E E V (k+mT)  (5.3) 

k=l  m=0 

In  order  to  simplify  the  derivation,  it  has  been  assumed  that  N=MT,  M an 
integer*.  From  the  periodicity  condition  v(k+mT)  = 'I then  (5.3) 
reduces  to 

*The  morc~gcnerai  case  is  tedious  and  contributes  little  to  the  final  result. 
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N - T M-1  J 2 

D(<i,T)  = ^ y^[n)  -2  q(k)  ^ y(k+mT)  ♦ M ^ q (k)  (5-4) 

n=l  k=l  m=0  k=l 


Since  the  basic  voiced  speech  waveform  q(l) qCr)  has  been  assumed 

completely  unknown  (i.e.,  the  correlation  properties  have  been  ignored*) 
then,  for  the  fixed  T,  the  minimizing  values  are  obviously 


qll^)  ~ jjr  ^ y(k+mT)  (5.5) 

m=0 


The  estimate  of  the  voiced  speech  waveform  is  therefore 


v(n|N)  = q(k)^^jT, 


where  the  notation  v(n  |N)  is  used  to  denote  the  fact  that  all  N measurements 
y(l).  y (2) , . . . ,y (N)  are  used  in  developing  the  estimate  of  the  voiced  speech 
waveform  v(n),  n<N.  In  that  sense,  the  estimator  is  unrealizable  . The 
corresponding  minimum  value  of  the  likelihood  function  is 


D(T)  = I [y(n)  - ''(n|N)]^  (5.7a) 

n=l 

N , ^ , 

= E y“(n)  - E v^CiiIn)  (5.7b) 

n=l  n=l 


Since  v(n  |n)  can  be  interpreted  as  the  output  of  a comb  filter  tuned  to 
pitch  period  T when  y(n)  is  tlie  input,  then  the  second  term  in  (5.7h)  simply 

*'l'he  more  general  case  Ts  treated  by  McAulay*^. 

A realizable  estimator  that  uses  only  the  data  up  to  time  n is 

1 M-' 

\7(n  |n)  = ^ E y(n-MT) . 
m=U 
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represents  the  energy  at  the  output  of  this  comb  filter.  Therefore  the 
opitmum  estimate  of  the  pitch  period  can  be  obtained  by  constructing  a 
bank  of  comb  filters  each  tuned  to  a slightly  different  pitch  period  and 
choosing  as  the  estimate  the  pitch  corresponding  to  the  comb  filter  for 
which  the  output  energy  is  largest. 

It  is  important  to  keep  in  mind  the  fact  that  voiced  speech  signals 
are  at  best  quasi -periodic;  hence,  there  is  a definite  limitation  on  the 
number  of  periods  over  which  the  averaging  process  is  a meaningful  operation. 
Since  values  of  the  pitch  frequency  generally  fall  within  the  range  70-300  Hz 
corresponding  to  pitch  periods  3-15  ms  long,  and  since  the  time  required  for 
a significant  alteration  in  the  vocal  tract  is  approximately  20  ms,  there 
can  be  1-7  repetitions  of  the  voiced  speech  waveform.  Therefore  the  number 
of  periods  over  which  the  data  is  averaged  is  a design  parameter  that  must 
be  chosen  to  carefully  trade  off  the  estimation  accuracy  and  the  quasi- 
periodic  nature  of  the  voiced  speech  waveform. 

A particularly  important  practical  case  corresponds  to  the  assumption 
that  the  voiced  speech  waveform  is  periodic  for  two  successive  periods. 

In  this  case  from  (5.5)  and  (5.6)  the  maximum  likelihood  estimate  of  the  voiced 
speech  signal  is 

v(n  |N)  = \ [y(n)  + y(n-T)]  (S.8) 

which  from  (5.7a)  results  in  the  residual  error 

^ 2 1^  2 
D(T)  = T (y(n)  - V(n|N)]‘^  = T ^ ' yC"-''')  1 

n=l  n=l 
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I'he  estimate  of  the  pitch  period  is  then  the  value  of  T that  minimizes  D(T). 

12 

This  criterion  has  already  been  proposed  for  pitch  estimation  by  Moorer 
and  Ross  et  al}^  except  that  the  squared  difference  has  been  approximated 
by  the  absolute  magnitude  difference  function  in  order  to  achieve  greater 
d>mamic  range  and  computational  speed.  Experimental  results  have  shown  that  the 
quality  of  the  pitch  estimates  is  roughly  equivalent  to  that  of  the  cepstral 
method  and  successful  operation  has  also  been  demonstrated  in  strong  noise 
environments.  For  this  reason  it  is  conjectured  that  the  (5.5)-(5.7)  represent 
a possible  solution  to  the  problem  of  robust  pitch  estimation.  To  see  this 
suppose  that  the  true  pitch  period  is  T^.  Then  the  observed  data  is 


yCn)  = v(n;T^)  + w(n) 


(5.10) 


where  v(n:T  ) = q(k)  . The  output  of  the  comb  filter  tuned  to  pitch 
^ o modT 

o 

period  T is 


M-1  , M-1 

v(n;T)  = i ^ v(n-mT;T  ) + w ^ w(n-mT) 
m=0  o M 


(5.11) 


The  noise  signal  at  the  output  of  the  comb  filter  is 


n(n;T)  = - Z w(n-mT) 

m=0 


(5.12) 


As  long  as  the  correlation  time  of  the  noise  process  is  less  than  the 


minimum 


pitch  period  of  interest,  then  if  w(n)  has  variance  o , n(n;T) 


will  have  variance  a‘/M.  For  the  comb  filter  tuned  to  pitch  T^  the  output 


signal  is 


ig.  3. 


Practical  implementation  of  the  optimum  pitch  estimator. 


(5.13) 


<^(n;T^)  = q(k)_^^^^.  > n(n;TJ 

o 

Therefore  there  is  anM:l  increase  in  signal-to-noise  ratio  as  a result  of 
using  the  comb  filter.  Applied  to  the  two-pulse  canceller  in  (5.10) 

(i.e.,  the  AMDF)  a 3 dB  improvement  in  SNR  is  obtained  for  the  class  of  noise 
processes  whose  correlation  times  are  less  than  the  minimum  pitch  period 
of  interest. 

Although  originally  proposed  as  a pitch  estimation  criterion  based 
on  ad  hoc  considerations,  the  maximum  likelihood  theory  shows  that  the  average 
squared  difference  function  is  optimum  and  robust  when  the  voiced  speech  wave- 
form is  modelled  as  a deterministic  quasi -periodic  waveform  with  periodicity 
extending  over  two  periods.  The  major  limitation  in  using  the  two-pulse 
comb  filter  (i.e.,  the  AMDF)  is  the  not  infrequent  occurrence  of  pitch  doubling 
which  occurs  when  the  voiced  speech  is  periodic  for  at  least  four  pitch  periods. 
At  the  expense  of  increasing  the  length  of  the  speech  buffer,  an  M-pulse 
comb  filter,  M J*3,  can  be  used  to  reduce  the  rate  at  which  pitch  doubling 
errors  occur. 

A further  enhancement  in  the  pitch  estimate  can  be  obtained  by  using 
the  low  pass  voiced  speech  filter  to  increase  the  pitch  estimator  SNR.  This 
corresponds  to  exploitation  of  the  global  correlation  properties  of  voiced 
speech.  The  approximate  matched  filter  configuration  of  the  pitcli  detector 
is  shown  in  Figure  3. 

V I . THF.  OPT  I MUM  CIASSIFIFR  AGAINST  COl.OllRliD  NOISF. 

i'here  are  several  examples  in  which  speech  in  non-white  acoustic 
background  noise  can  be  effectively  classified  using  the  algorithm  that  was 
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defined  to  be  optimum  against  white  noise.  In  particular  whenever  the  signal- 
to-noise  ratio  is  high,  the  white  noise  classifier  will  yield  acceptable 
performance.  There  are  some  cases,  particularly  if  the  SNR  is  low  and  the 
noise  is  highly  correlated,  where  significant  improvements  can  be  achieved 
by  taking  the  spectral  characteristics  of  the  noise  into  account.  In  this 
section  the  structure  of  the  optimum  classifier  will  be  derived  for  the  coloured 
noise  case  and  then  reasonable  practical  approximations  will  be  deduced  in 
order  to  simplify  the  complexity  of  the  signal  processor. 

For  this  classification  problem  the  data  corresponding  to  hypothesis 

H is 
i 

y(n)  = s^ (n)  + w^(n)  + w(n)  i=l,2,3  (6.1) 

where  w^ (n)  denotes  the  coloured  noise  present  on  all  three  hypotheses. 

Note  that  a white  noise  component,  w(n),  is  also  incorporated  into  the  model 
to  avoid  mathematical  problans  relating  to  singular  solutions.  The  standard 
approach  to  this  problem  is  to  precede  all  of  the  processing  by  a whitening 
filter  and  then  apjily  the  white  noise  solution.  This  was  the  approach 
taken  by  McAulay*'.  Although  mathematically  correct,  this  approach  encounters 
practical  difficulties  because  the  whitening  filter  essentially  preemphasizes 
the  speech  data.  As  has  already  been  discussed,  this  can  cause  the  higher 
formants  of  voiced  speech  to  acquire  the  s;ime  attributes  as  unvoiced  speech 

14 

which  makes  classification  difficult.  Mc.Aulay  and  Yates  have  derived 
an  estimator-correlator  classifier  that  does  not  require  a whitening  pre- 
filter. Drawing  on  their  results  and  f’ose  developed  in  Section  III 
sufficient  statistics  are  cimiputed.  fhey  are 


■»  *> 


two 


Z1 


N 

E 

n=l 


zfn) 


i = 2,3 


(6.2) 


where 


(n)  = E h. (n-k)  y(k)  i=2,3 
1 k=-~  ^ 


(6.3) 


is  the  linear  least  squared  error  unrealizable  estimate  of  s^(n)  based  on 
the  data  y(n)  = s^(n)  + w^(n)  + w(n)  and  where 


z(n)  = Eh  (n-k)  y(k) 
k=-  oo^ 


(6.4) 


is  the  result  of  passing  y(n)  through  the  clutter  rejection  filter  h (n) . 

c 

It  has  been  implicitly  assumed  that  the  speech  and  noise  processes  are  independent 

14 

and  quasi-stationary . The  transfer  functions  of  the  filters  are 


H.(f) 


G.(f) 


G.  (f)  + G (f)  + N /2 
1 c o 


i=2,3 


(6.5) 


H(f)  = !-■ 


G^(f) 


N /2 
o 


G (f)  + N /2 
c o 


G (f)+N  /2 
c o 


(6.6) 


where  G^(f),  G2(f),  G^(f)  represent  the  power  spectra  for  the  coloured 
noise,  unvoiced  speech  and  voiced  speech  processes  respectively.  The  second 
term  in  (6.5)  is  precisely  the  linear  least  squares  unrealizable  estimator  of 
w^(n)  based  on  the  signal  w^(n)  + w(n).  Therefore  the  clutter  filter  attempts 
to  remove  the  coloured  noise  from  the  data  before  performing  the  correlation 
operation.  The  optimum  classifier  structure  is  shown  in  Figure  4.  The 
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The  optimum  speech  classifier  against  coloured  noise 


classification  rule  is  similar  to  that  derived  for  white  noise,  equation  (3.7) 

except  that  the  sufficient  statistics  are  now  H and  H instead  of  <*-  and 

^2  ^3  ^2  ^3' 

Vll.  PRACTICAL  IMPLEMENTATION  OF  THE  ESTIMATOR-CORRELATOR  SPEECH  CUSSIFIER^ 

AGAINST  COLQPRfet)  NOISF 

The  arguments  for  simplifying  the  processing  of  voiced  and  unvoiced 

speech  proceed  along  the  same  lines  as  those  made  for  the  white  noise  case.  In 
particular,  if  knowledge  of  pitch  is  available  the  spectral  harmonics  of  voiced 
speech  are  matched  by  using  a comb  filter  in  cascade  with  the  Wiener  filter  de- 
signed on  the  basis  of  the  spectral  envelope.  Therefore  the  voiced  speech  estim- 


ator filter  is 


H (f)  = C(f;T) 


G (f) 

V 

G^(f)  * G^(f)  * N^/2 


where  C(f;T)  is  the  comb  filter  tuned  to  the  most  recent  pitch  estimate,  T. 


For  unvoiced  speech  the  estimator  filter  is* 


H (f)  = 


G^(f)  > G^(f) 


Lacking  knowledge  of  the  exact  form  of  G^(f)  and  G^(f)  a good  first  approxima- 
tion is  to  use  the  linear  phase  low  pass  (cutoff  above  2000  Hz)  and  high  pass 
(cutoff  below  1250  Hz)  filters  in  the  voiced  and  unvoiced  speech  channels 
as  was  done  in  the  white  noise  case.  This  insures  the  spectral  ortho- 
gonality of  the  two  speech  channels  and  enhances  the  speech-to-noise  ratio 
whenever  the  noise  spectrum  lies  outside  the  filter  passbands.  For  coloured 
noise,  however,  it  is  possible  that  all  of  the  noise  energy  will  lie  within  the 

filter  passbands  in  which  case  no  speech  enhancement  will  occur  if  only  the 

*i'he  effects  of  the  artificial  white  noise  term  have  been  neglected  at  this 
point  since  there  is  no  problem  with  singular  solutions. 


( 


•j 
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fixed  filters  are  used.  Somehow  additional  processing  tuned  to  reject  the 
clutter  will  have  to  precede  the  fixed  filters  in  the  speech  channels.  To 
develop  a clue  as  to  the  form  of  the  clutter  processor  it  is  necessary  to 
reexamine  (7.1)  and  (7.2).  Letting  G2(f)  = G^(f)  and  ^^(f)  = G^(f)  then  the 
unvoiced  and  voiced  speech  Wiener  filters  can  be  written  as 

G.  (f) 

H.(f)  = (7.3) 

^ G (f)  + G (f) 

1 c 

Realization  of  these  filters  requires  that  the  speech  and  noise  spectra  be 
known.  Since  the  noise  statistics  can  be  measured  during  the  silent  inter- 
vals it  is  reasonable  to  assume  that  the  clutter  spectrum  is  known. 
Unfortunately  a priori  estimates  of  the  speech  spectra  are  not  available  unless 
long  term  averages  are  determined  from  training  sets.  When  detailed  know- 
ledge of  the  frequency  distribution  of  the  speech  is  unavailable  a conservative 
approach  is  to  model  the  speech  as  white  noise  thereby  having  a flat  spectrum. 
Letting 

G. (f)  = a.  i = 2,  3 (7.4) 

1 1 

and  substituting  this  into  (7.3)  results  in  the  filters 

a . 

II. (f)  = i = 2.  3 (7.5) 

^ G (f)  + a . 

c 1 

Since  Hj^(f)»0  whenever  G (f)  >>  and  H^(f)  1 whenever  G^(f;«  a 

(7.5)  can  be  interpreted  as  a notch  filter  tuned  to  reject  "most"  of  the 
clutter  energy.  When  the  speech-to-noise  ratio  (SNR)  is  large  little 
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clutter  rejection  is  needed  and  should  be  large  since  this  results 
in  a passband  filter.  When  the  SNR  is  small,  then  the  clutter  must  be 
rejected  whatever  the  cost  in  speech  distortion  which  necessitates  a small 
value  for  a^.  It  follows,  therefore,  the  parameter  should  be  pro- 
portional to  the  speech-to-noise  ratio.  Since  the  clutter  power  is  known 
from  the  silent  intervals,  estimates  of  the  SNR  can  be  made  from  the  data 
frame  being  analyzed.  In  this  mode  the  distinction  between  voiced  and 
unvoiced  speech  disappears  and  only  a single  parameter  value  and  clutter  f’lter 
need  be  determined.  In  this  sense  the  clutter  filter  represents  an  adaptive 
prefilter  whose  output,  in  a conservative  sense,  represents  the  best  avail- 
able estimate  of  the  speech  waveform. 

The  results  of  this  discussion  are  summarized  in  Figure  5 which 
shows  the  practical  realization  of  the  optimum  classifier  operating  against 
a coloured  noise  background.  Except  for  the  clutter  filters  in  the 
reference  and  speech  channels  the  processing  is  identical  to  that  used  in  the 
white  noise  case.  Since  selection  of  the  tuning  parameters  and  cx  ^ depends 
on  the  noise  statistics  further  discussion  regarding  their  selection  will 
be  deferred  to  the  section  on  experimental  results. 

The  only  problem  that  remains  to  be  discussed  is  the  calculation 
of  the  clutter  filter  impulse  response  from  (7.5).  The  most  straight- 
forward approach  is  to  solve  the  Wiener-llopf  equation 

oo 

Y.  a,  IR  (k-i)  + o6  (k-j)J  = u6(j)  -oo  <j  < od  (7.0) 

k = -“  ^ 
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If  the  impulse  response  is  truncated  at  the  2p+l  coefficients, 
can  be  found  by  solving  (7.6)  numerically  using  the  Levinson  recursion. 

Another  approach  is  to  fit  an  all  pole  spectrum  to  (f ) + a using 
Linear  Prediction  techniques  and  use  the  spectral  coefficients  to  determine  the 
clutter  filter.  For  this  method  the  LPC  spectral  estimate  of  G^ (f ) + a can 
be  obtained  by  solving 

P 

Z a,  [R  (k-j)  +cx6(k-j)]  = R (j)  1 <j  <p  (7.7) 

k=l  ^ 

This  equation  can  be  solved  efficiently  using  the  Levinson  Recursion  and 
results  in  a p-pole  fit  to  the  clutter  spectrum.  The  estimated  spectrum  is 


A(z)A*(z) 


(7.8) 


where 


A(z) 


1 


P 

Z 

k=l 


-k 


(7.9) 


which  corresonds  to  the  Inverse  Filter  in  the  usual  LPC  analysis.  Sub- 
stituting (7.8)  into  (7.5)  results  in  the  Wiener  filter 

H(z)  = ~ A(z)  A*(z)  (7.10) 

Letting  y(n)  denote  the  input  sequence  and  s(n)  the  output  sequence  then 
S'(z)  = ~ A(z)  A*(z)  Y(z) 

= ~ A(z)  X(z)  (7.11) 
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where 


X(z)  = A*(z)  Y(z) 


(7.12) 


Since  the  LPC  coefficients  (a,  } are  real 

k 

P k 
A*(z)  = 1 - E a,z" 
k=l 


(7.13) 


and 


x(n)  = y(n)  - T.  a y(n+k) 
k=l 


(7.14) 


s(n)  = ^ [x(n)  - T.  a.x(n-k)] 

k=l 


(7.15) 


Therefore  the  unrealizable  Wiener  filter  can  be  implemented  by  the  cascade 
combination  of  an  inverse  filter  that  operates  on  p samples  of  future  data 
and  an  inverse  filter  that  operates  on  p samples  of  past  data.  Therefore 
a p-sample  buffer  must  be  available  to  provide  for  the  future  data.  The 
advantage  of  this  approach  is  that  the  length  of  the  impulse  response  is 
completely  determined  on  the  basis  of  the  number  of  poles  required  to  fit 
the  clutter  spectrum. 

VIII.  f-XPTRIMbNTAl,  RliSUl.TS 

The  signal  processing  concepts  developed  in  the  previous  sections 
were  evaluated  experimentally  using  speech  data  that  was  corrupted  by 
Airborne  Command  Post  (ACP)  noise.  Not  only  does  this  provide  a gotxl 
pedagogical  tool  for  illustrating  the  filtering  ideas  but  it  represents 
an  important  real-world  speech  encoding  environment  which  is  not  adequately 
solved  using  state-of-the-art  vocoder  technology. 
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The  noisy  speech  data  was  sampled  every  132  ^j^ec  (7575  Hz) 
and  158  samples  were  collected  to  define  a 20  millisec.  frame.  Figure  6a 
illustrates  a 20  millisec  sample  function  of  ACP  noise.  Figure  6f 
is  a plot  of  the  magnitude  of  its  Fourier  Transform  measured  in  dB.  The 
correlation  function  of  the  mth  frame  (i.e.,  the  current  frame)  of 
noise  data  was  computed  from 

N-l-k 

R (k;m)  = E x(n)  x(n+k)  k=0,l,...,p;  m=l,2,...  (8.1) 

^ n=0 

where  x(n)  is  the  Hainming  weighted  version  of  the  input  data  y(n) . A 
first  order  smoothed  correlation  function  was  then  computed  from 

R (k;m)  = [R  (k;m)  + vR  (k;m-l)}  (8.2) 

c ,m^y  'c 

1-Y 

In  general  the  weighting  constant  y should  be  chosen  to  reflect  the 
quasi-stationarity  of  the  noise  random  process.  For  ACP  noise  y = .95 
was  chosen  arbitrarily  and  seemed  to  produce  good  results. 

From  (6.6)  the  clutter  filter  in  the  reference  channel  was 

given  by 


(z;m) 


C^c  (z)+ac(ni) 


(8.3) 


The  impulse  response  was  found  using  Linear  Prediction  techniques  as 
described  in  the  previous  section.  This  necessitates  solving  the  Wiener- 
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(c)  UNVOICED  SPEECH  CHANNEL  SIGNAL 

(d)  VOICED  SPEECH  CHANNEL  SIGNAL 

- » ^ I 

(e)  PREFILTER  OUTPUT  signal 


Fig.  6.  Processor  response  due  to  noise 


Hopf  prediction  equation 


P _ 

^ a [R  (k-j;m)  + a (m)  6(k-j)]  = R (j;m)  1 < j < p (8.4) 

k=i  « c c.  c - - 


using  the  long  term  averaged  correlation  function  computed  at  the  last 
frame  (i.e.,  the  mth  frame).  A whole  class  of  clutter  filters  can  be 
obtained  simply  by  varying  the  parameter  a^(m).  Typical  transfer 
functions  from  this  class  are  shown  in  Figure  6g  for  three  values  of 
It  was  found  that  the  clutter  filter  defined  for  the  value  a^(m)  = R^(0;m) 
worked  well  for  ACP  noise.  For  other  noise  types  other  values  would 
probably  be  more  appropriate.  A little  experimentation  is  therefore 
required  to  tune  the  clutter  filter  to  different  noise  processes. 

The  unvoiced  and  voiced  speech  channels  are  preceded  by  another 
clutter  rejection  filter  given  by  (7.5),  namely 


a (m) 

H (z;m)  = V ■ . 7— 

s G^(z)  + a^(m) 


where  is  chosen  to  be  proportional  to  the  speech-to-noise  ratio 
measured  for  the  current  frame  of  data  (i.e.,  the  mth  frame).  Since 
Ry(0;ni)  represents  a measure  of  the  speech  plus  noise  energy  for  the  current 
frame  of  data  and  since  R^(0;m)  represents  a measure  of  the  long  term 
averaged  noise  energy,  then  a reasonable  estimate  for  the  speech-to-noise 
energy  is 

^(m)  = R (();m)  - R (0;m)  (8.b) 

'•»  y c 


.>5 


It  is  possible  that  the  energy  in  any  one  20  millisec  sample  function 
will  be  less  than  the  average  clutter  energy,  especially  if  that  sample 
function  contains  noise  alone  or  noise  plus  unvoiced  speech.  Therefore 
provision  must  be  made  to  bound  the  clutter  notch  parameter  away 
from  zero.  A reasonable  scheme  is  to  pick 

a^(m)  = max  [ ^(m) , a^(m)]  (8.7) 

which  guarantees  that  the  speech-clutter-filter  notch  will  never  be 
deeper  than  that  in  the  reference  channel.  As  before  the  impulse  res- 
ponse was  found  using  the  Linear  Prediction  power  spectrum  which  was 
obtained  by  solving  the  Mener-Hopf  predictor  equation  (8.4)  using 

a instead  of  t . 
s c 

The  output  of  the  speech  clutter  filter  was  then  used  as  the 
input  to  the  high-pass  and  low-pass  filters  characterizing  the  unvoiced 
and  voiced  speech  processing  channels  respectively.  The  filters  were 
both  21-tap  linear  jihase  digital  filters  designed  using  the  Parks-McClellan 
algorithm^''’.  Tlio  impulse  responses  and  frequency  characteristics  are  spe- 
cified in  the  Ajipendix.  No  attempt  was  made  to  optimize  the  filter 
design.  The  outputs  of  the  reference  channel  clutter  filter  z (m)  and  the 
unvoicedand  voiced  speech  filters  ii(n),  v(n)  are  shown  in  Pigures  6b,  be, 
6d.  According  to  eipiation  (.(>.2)  the  outputs  of  the  speech  fitters  were 
then  correlated  with  the  output  of  the  reference  channel  clutter  filter 
to  form  tlic  detection  statistics: 


,'vl 


N 


2j(m)  = n 

n = l 
•N 

z (n)u(n) 

(8.8a) 

n=l 

2{n)v(n) 

(8.8b) 

<^,(m)  = 1 (m)  - 2,  (m) 

•’  U V 

(8.8c) 

It  should  be  noted  that  the  comb  filter  has  been  left  out  of  the  voiced  j 

speech  processing  channel.  This  decision  was  made  to  show  that  good  1 


classifier  performance  could  be  obtained  without  having  to  make  a pitch 
estimate  which  simplifies  the  classifier  processing  which  is  necessary  for 
some  applications. 


The  detection  thresholds  were  obtained  by  driving  the  system 
with  ACP  noise  for  15  data  frames  (.3  sec).  This  is  the  only  training 
cycle  required  by  the  processor  and  should  be  relatively  easy  to  meet 
in  practice  because  there  is  always  a speech  free  interval  before  a 
talker  actually  speaks  into  the  encoding  device  after  having  turned  the 
machine  on.  Averaged  detection  statistics  for  the  training  noise  are 
computed  from 

I (m)  = [t(m)  + Yl.(m-1)]  i = l,2,3  (8.9) 

1-Y 

with  Y=  -95  as  before.  I'he  detection  thresholds  were  then  chosen 
to  be 


X.  (m)  = 1.5  Mm)  i = l,2 

X^(m)  = ^j(m)  - £^(m) 


(8.10) 


which  allows  for  moderate  statistical  fluctuations.  After  the  first 
15  data  frames  of  noise  have  been  processed  (m-15)  and  the  initial 
thresholil  setting  computed,  the  classification  process 


is  i n i t i a t ed  . 


The  next  frame  of  data  is  processed  and  the  detection  statistics  J..  (m+1) 
are  computed.  If  S.j(m+1)  < (m)  and  J..,(m+1)<  A^Cm)  then  the  data  are 

classified  as  silence  and  the  clutter  correlation  function,  (8.2)  and  the 
detection  thresholds  (8.9)  and  (8.10)  are  up-dated.  If  iij  (m+l)>  A^  (m) 
or  2..,(m+l)  >A^(m)  then  speech  is  declared  present  and  neither  the  clutter 
correlation  function  nor  the  detection  thresholds  are  changed.  No 
up-dating  is  done  until  the  next  frame  of  silence  is  detected.  This 
procedure  allows  the  classifier  to  track  noise  processes  whose  statistics 
vary  slowly  with  time.  Such  a classifier  structure  is  often  referred 
to  as  a decision-directed  detector  since  it  tells  itself  when  to  alter 
its  structure.  It  becomes  evident  therefore  that  the  detection  thresholds 
should  be  set  low  even  at  the  expense  of  a high  false  alarm  rate  (declaring 
noise  as  speech  is  a false  alarm).  It  would  be  a more  serious  error 
if  the  classifier  declared  speech  as  noise  since  then  all  the  clutter 
filters  and  detection  thresholds  would  be  tuned  to  reject  speech. 
Fortunately  this  malign  event  rarely  occurred  for  ACP  noise  and  when  it 
did  the  noise  always  completely  overpowered  the  speech  so  that  little 
change  in  the  filter  structures  occurred. 

The  effects  of  the  three  filtering  channels  on  the  three  sjicech 
types  will  be  examined  for  some  tyincal  cases  to  develo]i  a feeling  for 
the  classifier  operation,  i'igure  Oa  is  a plot  of  a 20  millisec  input 
sample  function  of  ACP  noise.  Figure  hf  is  the  corresponding  sliort - 
term  power  spectrum.  Figure  (\g  is  a plot  of  the  adaptive  clutter  filter 
transfer  function  in  the  reference  channel  (the  adaptive  prefilter). 
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For  ACP  noise  input,  it  has  adapted  in  such  a way  as  to  make  a -lOdB 

null  at  the  clutter  frequencies.  Figures  6b,  6c  and  6d  show  the 

respective  outputs  of  the  reference  channel,  the  high-pass  filtered 
unvoiced  speech  channel  and  the  low-pass  filtered  voiced  speech  channel. 

As  was  described  in  the  previous  section  the  output  of  the  speech 
channel  clutter  filter  represents  a minimum  mean  squared  error  estimate 
of  the  input  speech.  Figure  6e  shows  a plot  of  the  prefilter  output 
in  response  to  ACP  noise  at  the  input.  Of  course,  with  high  probability 
the  classifier  will  classify  the  frame  as  silence,  hence  one  has  the 
option  of  setting  the  prefilter  output  to  tero  which  removes  the  residual 
noise  completely. 

Although  the  comb  filter  discriminator  was  not  used  in  the 

classifier  it  remains  of  interest  to  evaluate  the  robustness  of  the  maximum 

likelihood  pitch  estimator  in  ACP  noise.  This  was  done  by  applying 
the  output  of  the  low  pass  filter,  v(n) , to  a bank  of  two-pulse  comb 
filters  covering  the  range  from  70  to  .'500  H_.  Figure  6h  is  a plot  of 
the  energy  at  the  output  of  the  comb  filters  as  a function  of  the  pitch 
period  for  the  ACP  noise  sample. 

The  same  sequence  of  data  arc  plotted  in  F'igures  8 and  9 
for  20  millisec  frames  of  unvoiced  and  voiced  speech  respectively. 

IMgures  7a  and  7f  show  that  the  unvoiced  specch-to-noisc  ratio  is  loss 
than  OdB  (it  is  roughly  -odB)  yet  Figure  7e  shows  that  the  jiref liter  ha'^ 
removed  a significant  ]iortion  of  the  clutter  waveform  while  allowing 
the  unvoiced  speech  waveform  to  ]iass  relativel)'  undisturbed.  Figures  8a 
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(a)  UNVOICED  SPEECH  SAMPLE  FUNCTION  I UNVOICED  SPEECH  POWER  SPECTRUM 


■*,  y*  Va.' * •’C  • 


(b)  REFERENCE  CHANNEL  SIGNAL 

- 


(c)  UNVOICED  SPEECH  CHANNEL  SIGNAL 
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(q)  prefilter  frequency  response 


(d)  VOICED  speech  channel  signal 
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(e)  prefilter  output  signal 


(h)  COMB  filter  response 


PITCH  PERIOD 


Fig.  7.  Processor  response  due  to  unvoiced  speech. 


and  8f  show  that  the  voiced  speech-to-noise  ratio  is  quite  large  (it 
is  roughly  9dB).  Figure  8g  shows  that  the  prefilter  transfer  function 
is  adjusted  to  allow  most  of  the  speech  to  pass  even  though  its  spectrum 
overlaps  that  of  the  AGP  noise.  This  shows  the  advantage  of  the  adaptive 
prefilter.  Had  a fixed  clutter  filter  been  used,  the  voiced  speech 
waveform  would  have  been  distorted  unnecessarily.  Figure  8h  shows 
that  the  pitch  estimate  is  perturbed  very  little  by  the  presence  of 
AGP  noise.  In  general  it  was  found  that  the  only  significant  pitch 
errors  were  the  effects  of  pitch  doubling  which  occurred  intermittently 
near  the  ends  of  a voiced  sound.  Figure  8e  shows  how  the  prefilter  attempts 
to  reproduce  the  voiced  speech  waveform. 

Having  established  the  basic  characteristics  of  the  classifier 
the  next  step  is  to  evaluate  the  frame-to-frame  performance  when  an 
AGP  noise  corrupted  utterance  is  applied  to  the  input.  Glassification 
errors  were  obtained  by  determining  the  true  speech  type  by  visually 
examining  the  waveform,  power  spectrum  and  comb  filter  energy  contour 
for  each  20  millisec  sample  function.  Statistics  were  accumulated 
for  a total  of  3 utterances  spoken  by  3 male  speakers  in  different  AGP 
noise  environments.  The  results  are  tabulated  in  Table  1.  From  these 
results  the  false  alarm  probability  (declare  speech  given  silence)  is 
estimated  to  be  9.4°o.  i'he  miss  probability  (declare  silence  given 
speech)  is  2.3°.  I'he  misses  mainly  occurred  for  unvoiced  speech  that 

had  been  comiHetely  overpowered  by  the  noise  (■ K'dB  SNK).  Irroneous 

classifications  (voiced<-yunvoiced)  occurred  at  tlie  rate  of  3°d.  Wlienever 
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Fig.  8.  Processor  response  due  to  voiced  speech. 
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TABLli  1 

CL/\SSIHKR  PERFORMANCE  STATISTICS 


i ESTIMATED 

TRUE  -- 

SILENCE 

UNVOICED 

VOICED 

SILENCE 

405 

14 

24 

' UNVOICED 

■ 4 

j 

1 

43 

« 

2 

VOICED 

i 

1 

1 

5 

170 
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a frame  represented  a mixture  of  voiced  and  unvoiced  speech  the  classifier 
always  chose  in  favour  of  voiced  speech.  This  event  could  be  reduced 
significantly  by  reducing  the  frame  period  (10  millisec  versus  20  millsec). 
Although  these  statistics  have  been  gathered  for  a relatively  small 
ensemble,  the  general  impression  is  that  the  performance  is  quite  good. 

Another  aspect  of  the  experimental  prograiij  was  the  recovery 
and  synthesis  of  noise-corrupted  speech  using  Linear  Prediction 
techniques.  The  voiced -unvoiced  decisions  and  the  pitch  estimates  were 
derived  using  the  methods  described  in  this  paper.  The  LPC  filter 
coefficients  were  estimated  from  the  prefilter  output  waveform.  F-or 
the  case  of  noise-corrupted  unvoiced  speech,  Figure  "h  for  example, 
the  prefilter  output  is  shown  in  Figure  7e.  Its  short  term  power 
spectrum  is  shown  in  Figi're  9 which  when  compared  with  that  for  the  input 
unvoiced  speech  plus  ACP  noise.  Figure  clearly  demonstrates  the 
action  of  the  adaptive  prefilter  in  eliminating  the  clutter.  The  I. PC 
power  spectrum  estimate  is  also  plotted  on  Figure  9 and  shows  that 
the  synthetic  speech  is  likely  to  reproduce  the  original  unvoiced  s]ieech. 

Of  course  the  ACI’  noise  will  cause  the  spectral  estimate  to  be  somewhat 
distorted  but  the  perception  of  the  additive  ACP  noise  will  have 
disappeared.  It  is  for  this  reason  that  tlie  synthetic  speech  is  perceived 
to  1)0  "noise-free". 

.Similar  results  are  obtained  for  the  voiced  speech  sample  function 
shown  in  l•■igurc  8a.  The  short-term  power  spectrum  of  tl)0  prefilter  out[>ut. 
Figure  8c,  is  plotted  in  Figure  10  and  should  be  compared  with  the  voiced 
speech  plus  noise  |)ower  spectrum  shown  in  Figure  8f.  ITie  corresiiond i ng 
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LPC  spectrum  shown  in  Figure  10  shows  the  distortion  in  the  first  ] 

format  due  to  the  presence  of  the  ACT  noise. 

LPC  synthetic  speech  was  generated  for  a number  of  utterances 
recorded  in  ACP  noise.  Compared  to  LPC  speech  in  which  no  adaptive 
prefiltering  was  employed,  •in  improvement  in  intelligibility 
was  obtained. 

IX.  CONCLUSIONS 

Using  statistical  decision  theory  a new  speech  classification  I 

algorithm  has  been  developed  in  the  form  of  an  estimator-correlator 
receiver.  The  structure  is  robust  in  the  sense  that  it  can  adapt 
to  time-varying  noise  fields  in  which  the  signal -to-noise  ratio  can  be 
quite  low  (less  than  lOdB) . For  noiseless  speech  the  classifier  simply 
involves  two  fixed  filters  and  requires  no  pitch  estimation  or  linear 
prediction  analysis  parameters.  For  noisy  speech  clutter  filters  must 
be  added  to  the  speech  and  reference  channels.  The  reference  clutter 

filter  is  developed  on  the  basis  of  an  initial  .7>  sec  sample  of  noise  ' 

data  while  the  other  adapts  to  the  speech  plus  noise  statistics 

calculated  for  each  frame.  If  a frame  is  classified  as  noise,  the 

reference  channel  filter  is  up-dated  so  that  time  varying  noise 

statistics  can  be  tracked. 

ilie  output  of  the  speech  channel  clutter  filter  represents 
an  imjn'oved  estimate  of  tlie  input  speech  in  the  sense  that  much  of 
the  additive  noise  has  been  cancelled  from  the  signal.  By  applying 
Linear  I’rediction  techniques  to  this  wavefonn,  more  i 7it  el  1 ig  i hi  e 
synthetic  sjieech  can  be  obtaineil. 
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A relatively  thorough  (non -real -time)  evaluation  of  the  classifier 
and  adaptive  prefilter  was  conducted  for  Airborne  Command  Post  noise 
and  surprisingly  good  results  were  obtained.  Based  on  a limited  number 
of  listening  tests,  the  LPC  synthetic  speech  using  the  prefilter 
output  was  found  to  be  more  intelligible  than  the  LPC  synthesis 
of  the  original  noisy  speech. 

No  attempt  was  made  to  optimize  the  design  of  the  fixed 
voiced  (low  pass)  and  unvoiced  (high  pass)  filters.  In  this  study 
21- tap  linear  phase  filters  were  used.  A better  approach  would  be  to 
obtain  long  term  statistics  for  voiced  and  unvoiced  speech  and  pick  the 
filter  length  and  passband  edges  to  more  closely  represent  the  average 
spectral  properties.  Another  useful  study  would  be  to  investigate  the 
possibility  of  using  recursive  filters  with  phase  compensation  to 
further  simplify  the  processing. 

Although  a first  order  attempt  was  made  to  improve  the  design 
of  the  clutter  filters,  other  methods  are  undoubtedly  possible. 

Additional  insights  are  also  needed  in  the  selection  of  the  clutter 
filter  design  parameter;  in  this  note  trial  and  error  was  used  to  make 
the  selection. 

Of  course,  the  real  test  of  any  speech  processing  algorithm 
is  obtained  in  a real-time  environment.  This  is  the  focus  of  the  current 
effort . 
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APPENDIX 


The  unvoiced  and  voiced  speech  Wiener  filters  were  approx- 
imated by  21-tap  linear  phase  high  and  low  pass  filters  designed  using 
the  Parks -McC lei 1 an  algorithm.  The  impulse  responses  used  in  the 
experimental  program  are  given  in  Table  2 (h(n)  = h(-n)).  The 
magnitude  of  the  frequency  responses  are  shown  in  Figures  11  and  12. 


T.ARLK  2 


IMPULSE 


RESPONSE 

UNVOICED  FILTER 

VOICED  FILTER 

h(l) 

-0.215110t>7E-01 

-0.38655568E-02 

h(2) 

0.55939741E-02 

-0.32053679E-01 

h(3) 

0.21661893E-01 

0.23418449E-01 

h(4) 

0.39310634E-01 

0. 13665602E-01 

h(5) 

0.45899481E-01 

-0.42199165E-01 

h(6) 

0.29383000E-01 

0.73566064E-02 

h(7) 

-0. 15331455E-01 

0.66053927E-01 

h(81 

-0.82191288E-01 

-0.65457523E-01 

h(9) 

-0. 15448785E+00 

-0.84543467E-01 

h(10) 

-0.2103.S391E+00 

0.30347985E+00 

h(ll) 

0.76869851E+00 

0.5914752.3E+00 

FREQUENCY 
(-p;iss  filter. 
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